Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3008: QoS-class resources #3004

Open
wants to merge 92 commits into
base: master
Choose a base branch
from

Conversation

marquiz
Copy link
Contributor

@marquiz marquiz commented Oct 7, 2021

  • One-line PR description:
    New KEP for adding class-based resources to CRI protocol
  • Other comments:

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 7, 2021
@marquiz
Copy link
Contributor Author

marquiz commented Oct 7, 2021

/cc @kad @haircommander

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Oct 7, 2021
@kikisdeliveryservice
Copy link
Member

kikisdeliveryservice commented Oct 8, 2021

Hi @marquiz

All KEP PRs must have an open issue in k/enhancements (this repo). Please open an issue and fill it out completely and rename this PR to KEP-issue number but in the title of this PR and in your README.md

Thanks!

@marquiz marquiz mentioned this pull request Oct 11, 2021
4 tasks
@marquiz marquiz changed the title KEP-NNN: Class-based resources in CRI KEP-3008: Class-based resources in CRI Oct 11, 2021
@marquiz marquiz force-pushed the devel/kep-class-based-resources branch from 3c8bfb5 to ead1a96 Compare October 11, 2021 06:48
@marquiz
Copy link
Contributor Author

marquiz commented Oct 11, 2021

All KEP PRs must have an open issue in k/enhancements (this repo). Please open an issue and fill it out completely and rename this PR to KEP-issue number but in the title of this PR and in your README.md

Thanks for the guidance @kikisdeliveryservice. Done

@marquiz marquiz force-pushed the devel/kep-class-based-resources branch 3 times, most recently from ff20ea2 to 5f2e9fe Compare October 19, 2021 11:46
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 17, 2022
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 1, 2022
@marquiz
Copy link
Contributor Author

marquiz commented Feb 1, 2022

/retitle KEP-3008: Class-based resources

@k8s-ci-robot k8s-ci-robot changed the title KEP-3008: Class-based resources in CRI KEP-3008: Class-based resources Feb 1, 2022
@marquiz marquiz marked this pull request as ready for review February 11, 2022 13:18
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 11, 2022
@marquiz marquiz force-pushed the devel/kep-class-based-resources branch 2 times, most recently from eefa0e3 to 8786f80 Compare February 22, 2022 14:38
@marquiz marquiz force-pushed the devel/kep-class-based-resources branch from c43c5d7 to f2001a4 Compare January 9, 2024 17:00
@marquiz
Copy link
Contributor Author

marquiz commented Jan 11, 2024

I'm trying to here capture here a summary of the PR evolution and status. The PR has been reviewed by numerous people and it has so many comments/conversations that it becomes impossible for me to summarize everything in a nice way.

Bits that are marked as unresolved as of today:

  • CRI detail: whether to use/not use common type for pod and container level QoS resources, (discussion)
  • What prefixi policy (k8s.io/ vs. example.com/ vs unprefixed) to use for QoS resource names (e.g. this discussion, this discussion)
  • Naming, some discussions/comments on that:
  • Everything under "Future work" should be considered WIP or UNRESOLVED
    • Separate API resource(s) describing (e.g. this comment), listed under "Future work" as a possible extension

Some previous concerns/shortcomings that have later been addressed:

  • pod-level vs. container-level QoS (discussion)
  • "official"/well-known/canonical names of Qos resources (e.g. this discussion), addressed in the proposal under Consts section (the details are still UNRESOLVED)
  • defaults, i.e. "what do I get if I request nothing" (e.g. discussion, another discussion), covered in the proposal e.g. under "Implicit defaults" and "PodStatus" sections
  • "Exotic" use cases (e.g. this comment and response), "Possible future scenarios" section under "User stories"
  • Class capacity (e.g. comment, another comment) was added. It is now possible to limit the number of users of classes.
  • Per runtime class QoS (comment), was listed as a non-goal of this KEP

@marquiz marquiz mentioned this pull request Jan 16, 2024
60 tasks
Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran out of time for this session, more later.

to other Kubernetes resource types (i.e. native resources such as `cpu` and
`memory` or extended resources) because you can assign that resource to a
particular container. However, QoS-class resources are also different from
those other resources because they are used to assign a _class identifier_,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have concrete examples of things where we want to apply QoS, but which REALLY don't feel like "resources"?

While "QoS" is somewhat jargon-y, it's pretty well understood in my experience. Alternately, if this is just about names, we could look for similar words?

"banded resources" ? It sort of implies linearness of bands

"tiered resources" ? Also implies linearity, which I don't think is necessarily true

"categorized" ?

"qualitative" ?

+ Name QoSResourceName
+ // Allowed classes.
+ Classes []string
+ // Capacity is the hard limit for usage of the class.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love that this is punted with no real idea how we might solve it


#### Cluster autoscaler

The cluster autoscaler support will be extended to support QoS-class resources.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this paragraph vague and unconvincing. :)

Autoscaling has really become a requirement for many users, and this text doesn't explain to me how the autoscalers (we now have more than one!) are able to reason about "if I make a new node of shape X, it will satisfy this pod". We left this for "later" in other efforts and it really bit us in the butt.

If you think you have an answer, can you please flesh this out? If you don't think you have an answer, we need to think about it.

@johnbelamaric @mwielgus

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point @thockin (especially in the light of the recent DRA activities). I admit that this is vague and lacking details.

I will get back to this. My plan is now to have a PoC of a cluster autoscaler to have confidence that it can be done and write out the details here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case it wasn't clear, my comment was not meant to be rude.

vague_and_unconvincing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin no offence taken 😅

I took me a while to get back to this as I was not really familiar with the cluster autoscaler details. I wanted to try it out in practice and I did do a PoC implementation based on a kind cluster with CAPD deployed and running cluster-autoscaler against that. Scaling up node groups with one (or more) existing nodes worked pretty much out-of-the-box (after building cluster-autoscaler against my PoC kubernetes scheduler implementation). Because all of the scheduling logic (of QoS resources) is in k/k scheduler itself, the autoscaler correctly runs simulations and can determine how many nodes in what nodegroups need to be scaled up (based on the QoS resources available in existing nodes), also providing log messages why certain node group isn't suitable. What is missing is the provider-specific mechanism to inform the cluster autoscaler about QoS-resources of empty node groups but I wouldn't expect that to be an insurmountable problem (follow along the lines what is done for extended resources). I updated the KEP accordingly. WDYT?

Why should this KEP _not_ be implemented?
-->

## Alternatives
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this not be done with existing device plugins framework and CDI https://github.com/cncf-tags/container-device-interface/blob/main/SPEC.md#cdi-json-specification?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aojea I cannot think how CDI could satisfy all the requirements here 🤔 First there's the Kubernetes API (for which CDI brings no avail). Then, complicating the device plugin API for the QoS-resources hasn't even been in considered tbh (instead, we try to make kubelet's life easier)

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: marquiz
Once this PR has been reviewed and has the lgtm label, please assign johnbelamaric for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@marquiz
Copy link
Contributor Author

marquiz commented Feb 1, 2024

Updated details on cluster autoscaler

@liggitt liggitt removed this from the v1.30 milestone Mar 14, 2024
- bump versions in kep.yaml
- fix typos
- added a use case for per-container OOM kill behavior
@marquiz
Copy link
Contributor Author

marquiz commented Apr 10, 2024

Small update, added per-container OOM kill behavior as a use case. There is a growing list of use cases for Kubernetes-managed QoS resources with links to ongoing efforts (e.g. swap and OOM kill)

@mkarrmann
Copy link

@marquiz As the person who's been most actively pushing for the OOM config option and just found about this KEP, I strongly agree that this is a great fit! This design is seems awesome and is a really nice generalization of the specific problem I was hoping to have solved!

container whose memory limit is higher than the requests, but that is treated
by the node as `Guaranteed`.

Taking this idea further, QoS-class resources could also make it possible to
Copy link
Contributor

@kannon92 kannon92 Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe kubernetes/kubernetes#78848 is a good tangigle use case to mention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kannon92 for the reference, I'll add it to the proposal. This is already speculated in Splitting Pod QoS Class section but I wasn't aware of this open issue.

QoS-class resource. Likely benefits of using the QoS-class resources mechanism
would be to be able to set per-namespace defaults with LimitRanges and allow
permission-control to high-priority classes with ResourceQuotas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea is smarter oom setups (systemd-oomd for example).

systemd-oomd allows one to set settings for oom (psi based pressure and swap usage) on a cgroup slice. I could envision a future where one can create a QoS and that gives knobs to control systemd-oomd for more aggressive oom.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://fedoraproject.org/wiki/Changes/EnableSystemdOomd

I was playing around with this and I explored the ideas of setting different knobs based on the existing kubepods slices.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. So you think a about pre-defined oom kill classes (in contrast to e.g. exposing every possible systemd-oomd knob to the user). At first thought, this sounds like a good fit. I'll add this to the proposal too, as a possible future use case.

that kubelet could evict a running pod that request QoS-class resources that
are no more available on the node. This should be relatively straightforward to
implement as kubelet knows what QoS-class resources are available on the node
and also monitors all running pods.
Copy link
Contributor

@kannon92 kannon92 Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some areas that eviction manager is starting to show its age.

  • If one wants to add more disks to kubelet, we lose the ability for eviction manager to monitor it.
  • PSI based eviction
  • Swap based eviction
  • Moving things out of root partition (logs come in mind)

I've been thinking that eventually we are going to need some kind of pluggable eviction manager based on certain resources. It should be general but I haven't really made much progress on it. wanted to throw this out there as it may be worth considering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kannon92 thanks for the idea. Do you have any PoC implementation on this? If you have, it would be nice to wire it up to my code and see how it works.


<!-- References -->
[intel-rdt]: https://www.intel.com/content/www/us/en/architecture-and-technology/resource-director-technology.html
[linux-resctrl]: https://www.kernel.org/doc/html/latest/x86/resctrl.html
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This URL now 404s

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New link:

Suggested change
[linux-resctrl]: https://www.kernel.org/doc/html/latest/x86/resctrl.html
[linux-resctrl]: https://docs.kernel.org/arch/x86/resctrl.html

@Madhu-1
Copy link

Madhu-1 commented Jun 26, 2024

Hey @marquiz, it's great to see that this is making progress! Do you have any updates on the timeline for when this version is planned to be released? The KEP indicates it's planned for 1.31, but I was wondering if it might end up being pushed to 1.32 or later instead.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 24, 2024
@kad
Copy link
Member

kad commented Sep 24, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 24, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 23, 2024
@kad
Copy link
Member

kad commented Dec 24, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-review Categorizes an issue or PR as actively needing an API review. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Status: Assigned
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.